Air Quality Project Proposal

Author

Ng Wei Herng 2302854

Published

July 7, 2024

These libraries will be required:

library(tidyverse)
Warning: package 'dplyr' was built under R version 4.4.1
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readxl)
Warning: package 'readxl' was built under R version 4.4.1
library(readr)
library(dplyr)
library(tidyr)

1 Original Data Visualization in News Media

The visualization titled “Comparing PM2.5 Concentrations in Capital Cities” created by Pallavi Rao (2023) and published on “The Visual Capitalist”, presents a snapshot of PM2.5 air pollution levels in various capital cities around the world for the year 2022. PM2.5 refers to particulate matter that is less than 2.5 micrometers in diameter, which is small enough to penetrate the lungs and enter the bloodstream, posing significant health risks.

This visualization uses a series of red circles to represent the PM2.5 concentrations in each capital city. The number of circles corresponds to the level of PM2.5 concentration that exceeds the World Health Organization’s (WHO) safe limit for PM2.5, which is 5 µg/m³. Any value above this indicates a higher risk for adverse health effects.

Figure 1: Visualized: Air Quality and Pollution in 50 Capital Cities (IQAir 2022 World Air Quality Report)

Figure 1: Visualized: Air Quality and Pollution in 50 Capital Cities (IQAir 2022 World Air Quality Report)

2 Critical Assessment of the Original Visualization

The original visualization made good use of color as the choice of red for the circles emphasizes the health hazard posed by high PM2.5 concentrations. It provided a clear legend and chose a consistent format for its presentation, which made sense for its purpose as a comparative visualization. The data was presented in a ranked manner, making it easier to follow the trends in air quality across different cities.

However, the visualization is overall cluttered and does not scale well, making it tedious when trying to compare or see the trend between a larger number of cities. There are several areas where the visualization could be improved to enhance its effectiveness and user experience:

  1. Overcrowding in High PM2.5 Cities: For cities with extremely high PM2.5 levels, the visualization becomes overcrowded with red circles, making it difficult to discern individual data points while also diminishing the impact of the difference in scale between such cities.
  2. Static Year Selection: The visualization is limited to 2022 data. Including multiple years would provide a more dynamic and comprehensive view of trends over time.
  3. No Regional Differentiation: While each city is labeled, there is no clear regional differentiation which could be useful in understanding broader regional trends and patterns.
  4. Lack of Depth: There are no interactive elements like info-tips or toggle buttons that allow users to explore the data in more depth or switch views between different time periods or concentration ranges.
  5. No Contextual Information: The visualization lacks additional context such as population density, industrial activity, or geographical features that could help explain the varying air quality levels in each city.

3 Proposed Improvements

  1. Alternative Visualization Type: An alternative visual representations such as a color gradient or bar chart could prevent overcrowding and improve clarity.
  2. Expansion on Available Data: Including options to view historical data from before 2022 will help in trend studies regarding the past history of air quality for the country or region.
  3. Geographical Grouping: Grouping the cities by region or continent could make the data more accessible and relevant for viewers interested in specific areas.
  4. Interactive Features: Instead of looking through the whole list of countries or regions to find a specific data point, a filter will allow users to zoom in on specific cities, view time-series data, and explore the sources and effects of pollution.
  5. Inclusion of Contextual Information: Including other relevant data such as population or growth domestic product (GDP) alongside PM2.5 levels could offer insights on the potential cause or impact when air quality rises or dips.
  6. Improved color coding: The color representation of the data can be improved by adding different colors and their corresponding gradients to allow for more distinct data representation (e.g., using green to signify good air quality, yellow to signify moderate air quality and red to signify bad air quality).

4 Data Cleaning

4.1 Data Source Summary

The original data set used for the visualization was sourced from the IQAir World Air Quality Report (2022). While IQAir provides an API for users to obtain their air quality data, the API only gives real-time data instead of data in a time series ranging from the past to recent years.

As such, the data set we have chosen to use comes from the World Health Organization (WHO) which provides data on air quality for various countries (2022). The data set contains information on PM2.5 concentrations for different countries and years. Alongside this data set, we have also chosen to use 3 additional data set for the purposes of enhancing the visualization, as well as to improve on the data engineering and data cleaning aspect of the WHO data set.

The additional data sets are:

  1. Country Codes (2024): This data set contains the alpha-3 country codes and sub-region category for each country, which will be used to merge with the WHO data set so that we may obtain a more accurate mean for missing PM2.5 values based on sub-region.
  2. Population Data (2024): This data set contains the population for each country, which will be used to provide context to the PM2.5 data within our visualization.
  3. GDP Data (2024): This data set contains the GDP for each country, which will be used to provide context to the PM2.5 data within our visualization.

Below is a glimpse and summary of the WHO air quality data set.

who_data <- read_excel("who_aap_2021_v9_11august2022.xlsx", sheet = "AAP_2022_city_v9")

regional_data <- read.csv("all.csv")
glimpse(who_data)
Rows: 32,191
Columns: 15
$ `WHO Region`                             <chr> "Eastern Mediterranean Region…
$ ISO3                                     <chr> "AFG", "ALB", "ALB", "ALB", "…
$ `WHO Country Name`                       <chr> "Afghanistan", "Albania", "Al…
$ `City or Locality`                       <chr> "Kabul", "Durres", "Durres", …
$ `Measurement Year`                       <dbl> 2019, 2015, 2016, 2015, 2016,…
$ `PM2.5 (μg/m3)`                          <dbl> 119.77, NA, 14.32, NA, NA, NA…
$ `PM10 (μg/m3)`                           <dbl> NA, 17.65, 24.56, NA, NA, NA,…
$ `NO2 (μg/m3)`                            <dbl> NA, 26.63, 24.78, 23.96, 26.2…
$ `PM25 temporal coverage (%)`             <dbl> 18, NA, NA, NA, NA, NA, NA, N…
$ `PM10 temporal coverage (%)`             <dbl> NA, NA, NA, NA, NA, NA, NA, N…
$ `NO2 temporal coverage (%)`              <dbl> NA, 83.96119, 87.93260, 97.85…
$ Reference                                <chr> "U.S. Department of State, Un…
$ `Number and type of monitoring stations` <chr> "NA", "NA", "NA", "NA", "NA",…
$ `Version of the database`                <dbl> 2022, 2022, 2022, 2022, 2022,…
$ Status                                   <lgl> NA, NA, NA, NA, NA, NA, NA, N…
summary(who_data)
  WHO Region            ISO3           WHO Country Name   City or Locality  
 Length:32191       Length:32191       Length:32191       Length:32191      
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
 Measurement Year PM2.5 (μg/m3)     PM10 (μg/m3)     NO2 (μg/m3)    
 Min.   :2000     Min.   :  0.01   Min.   :  1.04   Min.   :  0.00  
 1st Qu.:2014     1st Qu.: 10.35   1st Qu.: 16.98   1st Qu.: 12.00  
 Median :2016     Median : 16.00   Median : 22.00   Median : 18.80  
 Mean   :2016     Mean   : 22.92   Mean   : 30.53   Mean   : 20.62  
 3rd Qu.:2018     3rd Qu.: 31.00   3rd Qu.: 31.30   3rd Qu.: 27.16  
 Max.   :2021     Max.   :191.90   Max.   :540.00   Max.   :210.68  
                  NA's   :17143    NA's   :11082    NA's   :9991    
 PM25 temporal coverage (%) PM10 temporal coverage (%)
 Min.   :  0.00             Min.   :  2.568           
 1st Qu.: 88.60             1st Qu.: 87.945           
 Median : 97.00             Median : 96.039           
 Mean   : 90.79             Mean   : 90.583           
 3rd Qu.: 99.00             3rd Qu.: 98.938           
 Max.   :100.00             Max.   :100.000           
 NA's   :24916              NA's   :26810             
 NO2 temporal coverage (%)  Reference        
 Min.   :  1.923           Length:32191      
 1st Qu.: 93.208           Class :character  
 Median : 96.370           Mode  :character  
 Mean   : 93.697                             
 3rd Qu.: 98.927                             
 Max.   :100.000                             
 NA's   :12301                               
 Number and type of monitoring stations Version of the database  Status       
 Length:32191                           Min.   :2016            Mode:logical  
 Class :character                       1st Qu.:2022            NA's:32191    
 Mode  :character                       Median :2022                          
                                        Mean   :2022                          
                                        3rd Qu.:2022                          
                                        Max.   :2022                          
                                                                              

4.2 Handling of Missing Values

Based on the above summaries, we can see that there are missing values in the data set in the PM columns. We will need to handle these missing values before proceeding with the changes. Some methods we can use to handle missing values include:

  1. Dropping Missing Values: We can drop rows with missing values if they are not significant in number.
  2. Imputation: We can impute missing values with the mean, median, or mode of the column.

We will impute missing values by merging the WHO data set with the country codes data set to obtain the sub-region category for each country. We will then calculate the mean PM2.5 concentration for each sub-region and impute the missing values with the mean PM2.5 concentration of the corresponding sub-region.

# Merge the data
merged_data <- who_data %>%
  left_join(regional_data %>% select(alpha.3, sub.region), by = c("ISO3" = "alpha.3")) %>%
  select(1:6, sub.region)

head(merged_data)
# A tibble: 6 × 7
  `WHO Region`    ISO3  `WHO Country Name` `City or Locality` `Measurement Year`
  <chr>           <chr> <chr>              <chr>                           <dbl>
1 Eastern Medite… AFG   Afghanistan        Kabul                            2019
2 European Region ALB   Albania            Durres                           2015
3 European Region ALB   Albania            Durres                           2016
4 European Region ALB   Albania            Elbasan                          2015
5 European Region ALB   Albania            Elbasan                          2016
6 European Region ALB   Albania            Elbasan                          2017
# ℹ 2 more variables: `PM2.5 (μg/m3)` <dbl>, sub.region <chr>
# Function to impute missing PM2.5 values
impute_pm25 <- function(data) {
  data %>%
    group_by(sub.region, `Measurement Year`) %>%
    mutate(pm25_mean = mean(`PM2.5 (μg/m3)`, na.rm = TRUE)) %>%
    mutate(`PM2.5 (μg/m3)` = ifelse(is.na(`PM2.5 (μg/m3)`), pm25_mean, `PM2.5 (μg/m3)`)) %>%
    select(-pm25_mean)
}

imputed_data <- impute_pm25(merged_data)

head(imputed_data)
# A tibble: 6 × 7
# Groups:   sub.region, Measurement Year [4]
  `WHO Region`    ISO3  `WHO Country Name` `City or Locality` `Measurement Year`
  <chr>           <chr> <chr>              <chr>                           <dbl>
1 Eastern Medite… AFG   Afghanistan        Kabul                            2019
2 European Region ALB   Albania            Durres                           2015
3 European Region ALB   Albania            Durres                           2016
4 European Region ALB   Albania            Elbasan                          2015
5 European Region ALB   Albania            Elbasan                          2016
6 European Region ALB   Albania            Elbasan                          2017
# ℹ 2 more variables: `PM2.5 (μg/m3)` <dbl>, sub.region <chr>

4.3 Normalizing Column Names

We will also need to normalize the column names to ensure consistency and ease of access. This will involve converting all column names to lowercase, replacing spaces with underscores, and removing special characters.

# Standardize column names
colnames(imputed_data) <- tolower(c("WHO_Region", "ISO3", "WHO_Country_Name", "City_or_Locality", 
                                    "Measurement_Year", "PM2_5", "sub_region"))

head(imputed_data)
# A tibble: 6 × 7
# Groups:   sub_region, measurement_year [4]
  who_region      iso3  who_country_name city_or_locality measurement_year pm2_5
  <chr>           <chr> <chr>            <chr>                       <dbl> <dbl>
1 Eastern Medite… AFG   Afghanistan      Kabul                        2019 120. 
2 European Region ALB   Albania          Durres                       2015  15.5
3 European Region ALB   Albania          Durres                       2016  14.3
4 European Region ALB   Albania          Elbasan                      2015  15.5
5 European Region ALB   Albania          Elbasan                      2016  14.2
6 European Region ALB   Albania          Elbasan                      2017  15.2
# ℹ 1 more variable: sub_region <chr>

4.4 Data Type Conversion

We will convert the data types of the columns to their appropriate types to be safe.

# Convert data types of columns
imputed_data <- imputed_data %>%
  mutate(
    who_region = as.factor(who_region),
    iso3 = as.factor(iso3),
    who_country_name = as.factor(who_country_name),
    city_or_locality = as.factor(city_or_locality),
    measurement_year = as.numeric(measurement_year),
    pm2_5 = as.numeric(pm2_5),
    sub_region = as.factor(sub_region)
  )

str(imputed_data)
gropd_df [32,191 × 7] (S3: grouped_df/tbl_df/tbl/data.frame)
 $ who_region      : Factor w/ 6 levels "Western Pacific Region",..: 4 2 2 2 2 2 2 2 2 2 ...
 $ iso3            : Factor w/ 118 levels "AUS","NZL","KAZ",..: 73 86 86 86 86 86 86 86 86 86 ...
 $ who_country_name: Factor w/ 118 levels "Australia","New Zealand",..: 73 86 86 86 86 86 86 86 86 86 ...
 $ city_or_locality: Factor w/ 6874 levels "Albany","Bunbury",..: 4014 5137 5137 5138 5138 5138 5143 5143 5123 5123 ...
 $ measurement_year: num [1:32191] 2019 2015 2016 2015 2016 ...
 $ pm2_5           : num [1:32191] 119.8 15.5 14.3 15.5 14.2 ...
 $ sub_region      : Factor w/ 15 levels "Australia and New Zealand",..: 11 12 12 12 12 12 12 12 12 12 ...
 - attr(*, "groups")= tibble [149 × 3] (S3: tbl_df/tbl/data.frame)
  ..$ sub_region      : Factor w/ 15 levels "Australia and New Zealand",..: 1 1 1 1 1 1 1 1 1 1 ...
  ..$ measurement_year: num [1:149] 2010 2011 2012 2013 2014 ...
  ..$ .rows           : list<int> [1:149] 
  .. ..$ : int [1:9] 71 96 103 111 119 148 165 190 229
  .. ..$ : int [1:9] 72 97 104 112 120 149 166 191 230
  .. ..$ : int [1:52] 63 73 98 105 113 121 150 167 192 219 ...
  .. ..$ : int [1:27] 64 74 78 86 99 106 114 122 127 139 ...
  .. ..$ : int [1:68] 65 75 79 84 87 92 100 107 115 117 ...
  .. ..$ : int [1:67] 66 76 80 85 88 93 101 108 124 130 ...
  .. ..$ : int [1:68] 67 77 81 89 94 102 109 125 132 133 ...
  .. ..$ : int [1:19] 68 134 226 239 282 26072 26078 26082 26087 26100 ...
  .. ..$ : int [1:5] 69 135 227 240 283
  .. ..$ : int [1:69] 70 82 83 90 91 95 110 116 118 126 ...
  .. ..$ : int [1:2] 24512 32011
  .. ..$ : int [1:5] 24513 24516 29615 29616 32012
  .. ..$ : int [1:33] 24448 24450 24461 24462 24466 24467 24468 24473 24478 24482 ...
  .. ..$ : int [1:25] 5020 5705 7334 24449 24451 24469 24474 24494 24518 24530 ...
  .. ..$ : int [1:37] 5021 5216 5706 6236 7335 24452 24455 24458 24463 24470 ...
  .. ..$ : int [1:27] 5022 5067 5217 5707 5713 5976 6207 7336 7417 7962 ...
  .. ..$ : int [1:255] 4900 4911 4960 4967 4978 5007 5023 5050 5073 5096 ...
  .. ..$ : int [1:563] 4852 4858 4866 4868 4872 4878 4886 4898 4901 4906 ...
  .. ..$ : int [1:568] 4853 4859 4867 4869 4873 4879 4887 4899 4902 4905 ...
  .. ..$ : int [1:388] 4846 4862 4876 4880 4882 4890 4896 4903 4910 4914 ...
  .. ..$ : int [1:1332] 4847 4848 4850 4854 4856 4860 4863 4864 4870 4874 ...
  .. ..$ : int [1:990] 4849 4851 4855 4857 4861 4865 4871 4875 4885 4889 ...
  .. ..$ : int [1:17] 24528 24539 24550 24562 24574 24586 24598 24610 24621 24632 ...
  .. ..$ : int [1:397] 1797 1798 1806 1814 1822 1830 1838 1846 1854 1862 ...
  .. ..$ : int 18428
  .. ..$ : int [1:366] 1799 1807 1815 1823 1831 1839 1847 1855 1863 1871 ...
  .. ..$ : int [1:370] 1800 1808 1816 1824 1832 1840 1848 1856 1864 1872 ...
  .. ..$ : int [1:364] 1801 1809 1817 1825 1833 1841 1849 1857 1865 1873 ...
  .. ..$ : int [1:390] 1802 1810 1818 1826 1834 1842 1850 1858 1866 1874 ...
  .. ..$ : int [1:442] 1803 1811 1819 1827 1835 1843 1851 1859 1867 1875 ...
  .. ..$ : int [1:457] 1804 1812 1820 1828 1836 1844 1852 1860 1868 1876 ...
  .. ..$ : int [1:456] 1805 1813 1821 1829 1837 1845 1853 1861 1869 1877 ...
  .. ..$ : int [1:91] 2118 2127 2134 2140 2156 2162 2172 2177 2182 2201 ...
  .. ..$ : int [1:115] 2105 2119 2128 2135 2141 2150 2157 2163 2178 2183 ...
  .. ..$ : int [1:129] 2106 2111 2120 2129 2142 2149 2151 2158 2164 2173 ...
  .. ..$ : int [1:146] 2104 2107 2112 2121 2130 2143 2152 2155 2159 2165 ...
  .. ..$ : int [1:169] 57 2103 2108 2113 2122 2131 2136 2144 2160 2166 ...
  .. ..$ : int [1:161] 58 2114 2123 2132 2137 2145 2161 2167 2176 2187 ...
  .. ..$ : int [1:124] 59 2109 2115 2124 2146 2168 2188 2198 2213 2219 ...
  .. ..$ : int [1:39] 60 8832 8915 12367 12380 25014 25029 25038 25041 25047 ...
  .. ..$ : int [1:118] 61 2058 2116 2125 2138 2147 2153 2169 2199 2205 ...
  .. ..$ : int [1:102] 62 2059 2110 2117 2126 2133 2139 2148 2154 2170 ...
  .. ..$ : int 2060
  .. ..$ : int 15185
  .. ..$ : int 15186
  .. ..$ : int [1:4] 29625 29626 29627 29628
  .. ..$ : int [1:2] 24991 24993
  .. ..$ : int 24985
  .. ..$ : int [1:5] 12383 12386 24974 24981 24995
  .. ..$ : int [1:2] 12384 12387
  .. ..$ : int [1:2] 12385 12388
  .. ..$ : int [1:6] 24972 24975 24982 24984 24986 24994
  .. ..$ : int [1:4] 24973 24976 24979 24987
  .. ..$ : int [1:4] 12358 24977 24980 24988
  .. ..$ : int [1:5] 24978 24983 24989 24990 24992
  .. ..$ : int [1:184] 30236 30237 30248 30259 30269 30271 30281 30284 30292 30298 ...
  .. ..$ : int [1:181] 30235 30238 30249 30260 30272 30282 30285 30299 30309 30317 ...
  .. ..$ : int [1:180] 30239 30250 30261 30273 30286 30300 30310 30322 30335 30344 ...
  .. ..$ : int [1:311] 2605 2626 2632 2637 2643 2649 2655 2661 2667 2673 ...
  .. ..$ : int [1:178] 30241 30252 30263 30275 30287 30294 30302 30312 30324 30337 ...
  .. ..$ : int [1:318] 2606 2611 2618 2627 2633 2638 2644 2650 2656 2662 ...
  .. ..$ : int [1:342] 2607 2612 2619 2623 2628 2634 2639 2645 2651 2657 ...
  .. ..$ : int [1:368] 2608 2613 2620 2624 2629 2635 2640 2646 2652 2658 ...
  .. ..$ : int [1:360] 2603 2609 2614 2616 2621 2630 2641 2647 2653 2659 ...
  .. ..$ : int [1:340] 2604 2610 2615 2617 2622 2625 2631 2636 2642 2648 ...
  .. ..$ : int [1:186] 12296 12316 12324 12336 12337 14886 14895 14903 14911 14919 ...
  .. ..$ : int [1:219] 12297 12304 12317 12325 12338 12345 12352 14887 14896 14904 ...
  .. ..$ : int [1:229] 12298 12305 12318 12326 12339 12346 14888 14897 14905 14913 ...
  .. ..$ : int [1:250] 12299 12306 12311 12319 12327 12340 12347 12353 14889 14898 ...
  .. ..$ : int [1:266] 12300 12307 12312 12320 12328 12332 12341 12348 12354 14890 ...
  .. ..$ : int [1:271] 12301 12308 12313 12321 12329 12333 12342 12349 12355 14891 ...
  .. ..$ : int [1:279] 12302 12309 12314 12322 12330 12334 12343 12350 12356 14892 ...
  .. ..$ : int [1:293] 12303 12310 12315 12323 12331 12335 12344 12351 12357 14893 ...
  .. ..$ : int [1:2] 25389 28748
  .. ..$ : int [1:4] 18525 28749 32016 32023
  .. ..$ : int [1:17] 25316 25318 25319 25326 25388 26317 26322 26326 26334 26349 ...
  .. ..$ : int [1:27] 25314 25315 25317 25324 25325 25328 26318 26323 26327 26335 ...
  .. ..$ : int [1:53] 18514 25323 25329 25384 25391 25392 25398 25404 25405 26328 ...
  .. ..$ : int [1:35] 18517 18521 25320 25330 26319 26324 26329 26333 26337 26343 ...
  .. ..$ : int [1:52] 18515 18518 18522 25321 25331 26330 26338 26344 26348 26353 ...
  .. ..$ : int [1:45] 18519 18523 25322 25332 26320 26331 26339 26345 26354 26361 ...
  .. ..$ : int [1:17] 18520 18524 26321 26325 26332 26340 26346 26355 26362 26368 ...
  .. ..$ : int [1:56] 18516 24771 25327 25385 25386 25387 25390 25393 25394 25395 ...
  .. ..$ : int 28758
  .. ..$ : int [1:184] 1743 1753 1769 1783 2597 18535 18544 18553 18570 18584 ...
  .. ..$ : int [1:212] 1744 1754 1770 1784 2586 2589 2594 2598 18536 18545 ...
  .. ..$ : int [1:231] 1745 1755 1771 1785 2587 2590 2595 2599 18537 18546 ...
  .. ..$ : int [1:255] 1737 1746 1756 1763 1772 1777 1786 1791 2584 2588 ...
  .. ..$ : int [1:247] 1738 1747 1757 1764 1773 1778 1787 1792 2585 2592 ...
  .. ..$ : int [1:11] 1739 1748 1758 1765 1774 1779 1788 1793 2593 2602 ...
  .. ..$ : int [1:318] 1740 1749 1759 1766 1780 1794 18526 18540 18549 18558 ...
  .. ..$ : int [1:321] 1741 1750 1751 1760 1767 1775 1781 1789 1795 18527 ...
  .. ..$ : int [1:386] 1742 1752 1761 1768 1776 1782 1790 1796 18528 18533 ...
  .. ..$ : int [1:400] 1 1762 18529 18534 18543 18552 18561 18569 18578 18581 ...
  .. ..$ : int [1:24] 20899 20907 20914 20921 20936 20948 20960 20966 20977 20985 ...
  .. ..$ : int [1:817] 2081 2089 12396 12397 12405 12422 12431 12440 12448 12456 ...
  .. ..$ : int [1:9] 12413 12432 12495 12749 13003 13039 14648 14706 14871
  .. ..$ : int [1:48] 14 12414 12423 12433 12464 12469 12496 12561 12700 12709 ...
  .. ..$ : int [1:705] 12389 12398 12406 12415 12424 12434 12441 12449 12457 12465 ...
  .. .. [list output truncated]
  .. ..@ ptype: int(0) 
  ..- attr(*, ".drop")= logi TRUE

4.5 Removing Duplicates

We will check for and remove any duplicate rows in the dataset to ensure data integrity.

initial_row_count <- nrow(imputed_data)

imputed_data <- imputed_data %>%
distinct()

# Check if any duplicates were removed
final_row_count <- nrow(imputed_data)
duplicates_removed <- initial_row_count - final_row_count

if (duplicates_removed > 0) {
  message(duplicates_removed, " duplicate rows were found and removed.")
} else {
  message("No duplicate rows were found.")
}
141 duplicate rows were found and removed.

4.6 Population Data & GDP Data

The steps to clean the GDP data is the same as the steps to clean the population data.

To clean the population data, we need to remove the metadata rows, reshape the data to have their respective year and population column so that we can merge it with the WHO data set.

# Read in the dataset and skip the first 4 rows
population_data <- read_csv("API_SP.POP.TOTL_DS2_en_csv_v2_580248.csv", skip = 4)
New names:
Rows: 266 Columns: 69
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(4): Country Name, Country Code, Indicator Name, Indicator Code dbl (64): 1960,
1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, ... lgl (1): ...69
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...69`
# Gather the year columns into key-value pairs of year and population
population_data <- population_data %>%
  gather(key = "year", value = "population", -`Country Name`, -`Country Code`, -`Indicator Name`, -`Indicator Code`) %>%
  mutate(year = as.integer(year))
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `year = as.integer(year)`.
Caused by warning:
! NAs introduced by coercion
# Rename columns for clarity
colnames(population_data) <- c("country_name", "country_code", "indicator_name", "indicator_code", "year", "population")

write_csv(population_data, "cleaned_population_data.csv")

head(population_data)
# A tibble: 6 × 6
  country_name       country_code indicator_name indicator_code  year population
  <chr>              <chr>        <chr>          <chr>          <int>      <dbl>
1 Aruba              ABW          Population, t… SP.POP.TOTL     1960      54608
2 Africa Eastern an… AFE          Population, t… SP.POP.TOTL     1960  130692579
3 Afghanistan        AFG          Population, t… SP.POP.TOTL     1960    8622466
4 Africa Western an… AFW          Population, t… SP.POP.TOTL     1960   97256290
5 Angola             AGO          Population, t… SP.POP.TOTL     1960    5357195
6 Albania            ALB          Population, t… SP.POP.TOTL     1960    1608800

4.7 Cleaned Data

This is the cleaned data

write_csv(imputed_data, "cleaned_imputed_data.csv")

head(imputed_data)
# A tibble: 6 × 7
# Groups:   sub_region, measurement_year [4]
  who_region      iso3  who_country_name city_or_locality measurement_year pm2_5
  <fct>           <fct> <fct>            <fct>                       <dbl> <dbl>
1 Eastern Medite… AFG   Afghanistan      Kabul                        2019 120. 
2 European Region ALB   Albania          Durres                       2015  15.5
3 European Region ALB   Albania          Durres                       2016  14.3
4 European Region ALB   Albania          Elbasan                      2015  15.5
5 European Region ALB   Albania          Elbasan                      2016  14.2
6 European Region ALB   Albania          Elbasan                      2017  15.2
# ℹ 1 more variable: sub_region <fct>

5 Conclusion

The data has now been cleaned and is ready for visualization, we will be using ggplot2 to create the visualizations and ggplotly to render the plot interactive. The proposed improvements will be implemented to enhance the clarity and depth of the visualization, providing a more interactive and informative experience for users. By incorporating these enhancements, we aim to create a more engaging and insightful visualization that effectively communicates the trends in air quality across countries globally.

6 References

  1. Rao, P. (2024, January 6). Visualized: Air quality and pollution in 50 capital cities. Visual Capitalist
  2. Air quality database 2022. (2024, June 20)
  3. ISO-3166-Countries-with-Regional-Codes. (2024, June 19)
  4. World Bank Group Population Data. (2024)
  5. World Bank Group GDP Data. (2024)